Skip to content

feat: Deepseek migration to Megatron-Bridge + CP support#1059

Merged
terrykong merged 30 commits intomainfrom
yifu/mbridge_dsv3_mcoretot
Sep 11, 2025
Merged

feat: Deepseek migration to Megatron-Bridge + CP support#1059
terrykong merged 30 commits intomainfrom
yifu/mbridge_dsv3_mcoretot

Conversation

@yfw
Copy link
Contributor

@yfw yfw commented Sep 4, 2025

What does this PR do ?

This PR provides deepseek support using Megatron-Bridge. It also adds CP support for deepseek by using the latest mcore. This PR also includes an update to the latest mcore (taken from #1042).

Closes #1043
Closes #821

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

Summary by CodeRabbit

  • New Features

    • Added a global config flag to toggle MoE permutation fusion (default: false) across examples and tooling.
    • Enabled trusted remote-code support for model and tokenizer loading.
  • Chores

    • Updated third‑party submodules to newer revisions.
    • Added filelock to pinned dependencies.
  • Tests

    • Updated unit test configs to include the new MoE permutation fusion flag.

yfw added 8 commits September 2, 2025 18:29
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
@yfw yfw requested review from terrykong and yaoyu-33 September 4, 2025 00:33
@yfw yfw changed the title feat: CP support for deepseek feat: Support deepseek with Megatron-Bridge + CP support Sep 4, 2025
@yfw yfw changed the title feat: Support deepseek with Megatron-Bridge + CP support feat: Deepseek migration to Megatron-Bridge + CP support Sep 4, 2025
@terrykong
Copy link
Collaborator

Since we're moving Megatron ahead, can you run all the Megatron nightly tests? I can share the baselines from the bridge PR for comparison

yfw added 3 commits September 3, 2025 23:35
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
yfw and others added 11 commits September 4, 2025 13:03
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
logprob_mb_tokens should be an optional seq packing parameter (not needed in SFT)

first attempt to test run times

more timeout fixes

remove train_mv_tokens/logprob_mb_tokens since it's not needed in the
init and causes crashes

Signed-off-by: Terry Kong <terryk@nvidia.com>

fix gemma27b parallel config

Signed-off-by: Terry Kong <terryk@nvidia.com>

revert steps and time since that hides regression

Signed-off-by: Terry Kong <terryk@nvidia.com>

revert other configs that hide regression

Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
@github-actions
Copy link

github-actions bot commented Sep 6, 2025

✅ Submodule Fast-Forward Check Results

Check based on commit: d83ba06 (PR #1059 from yifu/mbridge_dsv3_mcoretot)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)
Megatron-LM: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
@github-actions
Copy link

github-actions bot commented Sep 9, 2025

✅ Submodule Fast-Forward Check Results

Check based on commit: f2e21fc (PR #1059 from yifu/mbridge_dsv3_mcoretot)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)
Megatron-LM: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (8)
nemo_rl/models/megatron/common.py (6)

36-44: Return annotation does not match the actual return (5-tuple)

Function returns five values but type hints specify four; also second element type is missing.

Apply:

-def _pack_sequences_for_megatron(
-    input_ids: torch.Tensor,
-    seq_lengths: torch.Tensor,
-    pad_individual_seqs_to_multiple_of: int = 1,
-    pad_packed_seq_to: Optional[int] = None,
-    cp_rank: int = 0,
-    cp_size: int = 1,
-) -> tuple[torch.Tensor, PackedSeqParams, torch.Tensor, Optional[torch.Tensor]]:
+def _pack_sequences_for_megatron(
+    input_ids: torch.Tensor,
+    seq_lengths: torch.Tensor,
+    pad_individual_seqs_to_multiple_of: int = 1,
+    pad_packed_seq_to: Optional[int] = None,
+    cp_rank: int = 0,
+    cp_size: int = 1,
+) -> tuple[torch.Tensor, torch.Tensor, PackedSeqParams, torch.Tensor, Optional[torch.Tensor]]:

53-60: Docstring “Returns” order is wrong

Docstring says first is packed_input_ids, but code returns all_input_ids first. Fix to prevent misuse.

Apply:

-        Tuple of:
-        - packed_input_ids: Packed input tensor [1, T]
-        - input_ids_cp_sharded: Sharded input tensor [cp_size, T // cp_size]
+        Tuple of:
+        - all_input_ids: Concatenated (unsharded) input tensor [1, T_total]
+        - input_ids_cp_sharded: CP-sharded packed input tensor for this rank [1, T_shard]
         - packed_seq_params: PackedSeqParams object
         - cu_seqlens: Cumulative sequence lengths
         - cu_seqlens_padded: Padded cumulative sequence lengths

72-91: CP padding invariant not enforced (can break CP semantics)

When cp_size > 1, each sequence should be padded to multiple of cp_size*2 (per your Notes), but pad_factor is driven solely by pad_individual_seqs_to_multiple_of. This can cause misalignment between pack/unpack and TE kernels.

Apply:

-    pad_factor = pad_individual_seqs_to_multiple_of
+    pad_factor = pad_individual_seqs_to_multiple_of
+    if cp_size > 1 and pad_factor <= 1:
+        # Ensure per-seq padding meets CP requirements for causal attention load balancing
+        pad_factor = cp_size * 2

And consider documenting that callers can override via pad_individual_seqs_to_multiple_of.


121-129: Last-sequence padding can go negative

When pad_packed_seq_to < running_seq_len, padded_seq_len becomes negative and corrupts running_seq_len. Add a check.

Apply:

             if b == batch_size - 1 and pad_packed_seq_to is not None:
-                padded_seq_len = pad_packed_seq_to - running_seq_len
+                extra = pad_packed_seq_to - running_seq_len
+                if extra < 0:
+                    raise ValueError(
+                        f"pad_packed_seq_to ({pad_packed_seq_to}) smaller than accumulated length ({running_seq_len})"
+                    )
+                padded_seq_len = extra

355-361: Guard against temperature=0

Division by zero possible if config sets temperature: 0.

Apply:

-            output_tensor.div_(policy_cfg["generation"]["temperature"])
+            temp = policy_cfg["generation"].get("temperature", 1.0)
+            if temp and temp != 1.0:
+                output_tensor.div_(temp)

422-446: broadcast_tensor uses global ranks with group operations

dist.broadcast(_object_list|tensor) expects src to be the rank within the given group. Here, src_rank is documented and used as a global rank, which can target the wrong process.

Minimal fix: convert global src to group src before broadcasts, or change the API to accept src_group_rank. Example conversion:

-    if rank == src_rank:
+    group_rank = dist.get_rank(group)
+    # Convert global to group rank if supported; otherwise, require group-local rank
+    try:
+        src_group_rank = dist.get_group_rank(group, src_rank)  # PyTorch 2.3+
+    except Exception:
+        src_group_rank = src_rank  # Fallback: assume caller passed group-local rank
+
+    if group_rank == src_group_rank:
         if tensor is None:
             raise ValueError(f"Rank {rank} is source ({src_rank}) but tensor is None.")
@@
-        dist.broadcast_object_list(object_list, src=src_rank, group=group)
+        dist.broadcast_object_list(object_list, src=src_group_rank, group=group)
@@
-    dist.broadcast(tensor, src=src_rank, group=group)
+    dist.broadcast(tensor, src=src_group_rank, group=group)

Also update the docstring to clarify src is group-local.

tools/refit_verifier.py (1)

357-363: Fix tokenizer call: padding_side is not a valid encode kwarg

Set tokenizer.padding_side before calling, instead of passing as an argument.

Apply:

-    tokenized = tokenizer(
-        [prompt],
-        padding=True,
-        truncation=True,
-        return_tensors="pt",
-        padding_side="right",
-    )
+    tokenizer.padding_side = "right"
+    tokenized = tokenizer(
+        [prompt],
+        padding=True,
+        truncation=True,
+        return_tensors="pt",
+    )
nemo_rl/models/policy/megatron_policy_worker.py (1)

772-779: Gate trust_remote_code for tokenizer and AutoBridge

Same concern: make this configurable rather than always True.

Apply:

-        self.megatron_tokenizer = build_tokenizer(
+        self.megatron_tokenizer = build_tokenizer(
             tokenizer_config,
             make_vocab_size_divisible_by=self.megatron_cfg.model.make_vocab_size_divisible_by
             // self.cfg["megatron_cfg"]["tensor_model_parallel_size"],
             tensor_model_parallel_size=self.cfg["megatron_cfg"][
                 "tensor_model_parallel_size"
             ],
-            trust_remote_code=True,
+            trust_remote_code=self.cfg.get("trust_remote_code", False),
         )
@@
-        self.megatron_bridge = AutoBridge.from_hf_pretrained(
-            hf_model_name, trust_remote_code=True
-        )
+        self.megatron_bridge = AutoBridge.from_hf_pretrained(
+            hf_model_name, trust_remote_code=self.cfg.get("trust_remote_code", False)
+        )
🧹 Nitpick comments (14)
examples/configs/recipes/llm/dpo-llama3.1-8b-instruct-4n8g-megatrontp2pp2-quick.yaml (2)

65-66: Nit: normalize boolean casing.

This file mixes True/False and true/false. Consider standardizing (e.g., all lower-case) for consistency across configs.


48-76: Optional: centralize repeated MoE defaults.

Given this flag is added across many configs, consider a shared partial (e.g., examples/configs/_partials/megatron_moe_defaults.yaml) and reference via defaults to reduce drift.

examples/configs/grpo_math_1B_megatron.yaml (2)

90-95: Optional: clarify comment scope and doc.

The “~20% perf speedup with sequence packing” comment sits above apply_rope_fusion. If that claim doesn’t pertain to moe_permute_fusion, consider a brief note for moe_permute_fusion (e.g., when to enable, known hardware/TE constraints).


74-136: Optional: deduplicate Megatron MoE block across recipes.

Same suggestion as in the other file—extract shared megatron_cfg MoE defaults to a partial to keep one source of truth (freeze_moe_router, router dtype/type, moe_permute_fusion, etc.).

3rdparty/Megatron-LM-workspace/Megatron-LM (1)

1-1: Prefer pinning to a stable tag or documenting a lockfile mapping; add a compat note.

Current pin is a raw SHA from a feature branch lineage. Consider:

  • Pinning to an annotated tag (if available) or
  • Keeping the SHA but recording it in a third_party.lock with a short “compat matrix” (Megatron-LM SHA ↔ Megatron-Bridge SHA ↔ TE version) to reduce drift and ease future rebases.
examples/configs/dpo.yaml (1)

112-112: Add a short doc comment and keep style consistent for booleans.

  • Please add a one-liner above to explain when to enable moe_permute_fusion (HW/TE/Megatron-Bridge requirements, known limitations).
  • Minor: booleans in this block mix True/true/false. Consider normalizing within megatron_cfg for readability.

Proposed tweak:

-    moe_permute_fusion: false
+    # Enables fused token<->expert permutation kernels for MoE (requires recent Megatron-Bridge/TE).
+    moe_permute_fusion: false
examples/configs/recipes/llm/grpo-qwen3-30ba3b-8n8g-megatron.yaml (1)

76-76: Confirm interaction with freeze_moe_router=true and EP=4.

With expert_model_parallel_size: 4 and freeze_moe_router: true, fused permute should be safe but offers limited benefit if routing is static. Keep false by default unless nightly perf shows gains; add a note inline for when to flip.

Suggested comment:

-    moe_permute_fusion: false
+    # Set true if kernels are available; verify with GRPO nightly (EP=4, TP=4, PP=4).
+    moe_permute_fusion: false
examples/configs/recipes/llm/grpo-llama3.1-8b-instruct-1n8g-megatron-fp8.yaml (1)

74-74: FP8 path: gate fused permute on TE 2.6+ only.

This recipe targets FP8 vLLM training. Please ensure fused permute kernels are compatible with your TE version; otherwise leaving false is correct. Add a guard note to avoid accidental enablement.

-    moe_permute_fusion: false
+    # FP8: enable only with TE >= 2.6 and validated kernels.
+    moe_permute_fusion: false
examples/configs/grpo_math_1B.yaml (1)

77-77: Consistency + docs: add a brief note and align naming in docs/tools.

  • Add a brief inline comment on the flag purpose.
  • Ensure tools/refit_verifier.py and README mention moe_permute_fusion to keep configs, tooling, and docs in sync.
-    moe_permute_fusion: false
+    # Toggle fused MoE permutation kernels (perf). Default off for safety.
+    moe_permute_fusion: false
nemo_rl/models/megatron/common.py (1)

425-475: Device handling assumes CUDA; add CPU-safe fallback

torch.cuda.current_device() will fail on CPU-only runs. Use the source tensor’s device from metadata, or default to CPU when CUDA is unavailable.

Apply:

-    device = torch.cuda.current_device()
+    device = torch.device("cuda", torch.cuda.current_device()) if torch.cuda.is_available() else torch.device("cpu")

And prefer allocating tensors on the received device encoded in metadata.

tools/refit_verifier.py (3)

158-161: Align Megatron vs vLLM max_new_tokens

Megatron config sets max_new_tokens to args.max_sequence_length while vLLM uses args.max_new_tokens. This can skew comparisons. Recommend using the same source.

Apply:

-            "max_new_tokens": args.max_sequence_length,
+            "max_new_tokens": args.max_new_tokens,

279-281: Use boolean for enforce_eager

Minor: value is a string; prefer a boolean.

Apply:

-            "enforce_eager": "False",
+            "enforce_eager": False,

424-431: Avoid dumping large structures to stdout

The raw print of megatron_input_data and vllm_logprobs_data can be huge. Consider gating under a --verbose flag.

Also applies to: 425-431

nemo_rl/models/policy/megatron_policy_worker.py (1)

1169-1177: Prefer tokenizer ids over hard-coded zeros

Pass actual pad/eos token IDs to get_ltor_masks_and_position_ids for correctness across models.

Apply:

-                attention_mask, _, position_ids = get_ltor_masks_and_position_ids(
-                    data=input_ids,
-                    eod_token=0,  # used for loss_mask, which we don't use
-                    pad_token=0,  # used for loss_mask, which we don't use
+                attention_mask, _, position_ids = get_ltor_masks_and_position_ids(
+                    data=input_ids,
+                    eod_token=self.tokenizer.eos_token_id or 0,
+                    pad_token=self.tokenizer.pad_token_id or 0,
                     reset_position_ids=False,
                     reset_attention_mask=False,
                     eod_mask_loss=False,
                     pad_mask_loss=False,
                 )
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1c85276 and f2e21fc.

📒 Files selected for processing (24)
  • .gitmodules (1 hunks)
  • 3rdparty/Megatron-Bridge-workspace/Megatron-Bridge (1 hunks)
  • 3rdparty/Megatron-LM-workspace/Megatron-LM (1 hunks)
  • examples/configs/dpo.yaml (1 hunks)
  • examples/configs/grpo_math_1B.yaml (1 hunks)
  • examples/configs/grpo_math_1B_megatron.yaml (1 hunks)
  • examples/configs/recipes/llm/dpo-llama3.1-8b-instruct-4n8g-megatron.v2.yaml (1 hunks)
  • examples/configs/recipes/llm/dpo-llama3.1-8b-instruct-4n8g-megatrontp2pp2-quick.yaml (1 hunks)
  • examples/configs/recipes/llm/grpo-llama3.1-8b-instruct-1n8g-megatron-fp8.yaml (1 hunks)
  • examples/configs/recipes/llm/grpo-llama3.2-1b-instruct-1n8g-megatron.yaml (1 hunks)
  • examples/configs/recipes/llm/grpo-math-qwen3-30ba3b-megatron-tp4-32k.yaml (1 hunks)
  • examples/configs/recipes/llm/grpo-moonlight-16ba3b-4n8g-megatron.yaml (1 hunks)
  • examples/configs/recipes/llm/grpo-qwen2.5-7b-instruct-4n8g-megatron.yaml (1 hunks)
  • examples/configs/recipes/llm/grpo-qwen3-30ba3b-8n8g-megatron.yaml (1 hunks)
  • examples/configs/recipes/llm/sft-llama3.1-70b-8n8g-tp4pp2-long-megatron.yaml (1 hunks)
  • examples/configs/recipes/llm/sft-llama3.1-8b-1n8g-megatron-seqpack.yaml (1 hunks)
  • examples/configs/recipes/llm/sft-llama3.1-8b-1n8g-megatron.yaml (1 hunks)
  • examples/configs/sft.yaml (1 hunks)
  • examples/configs/sft_openmathinstruct2_megatron.yaml (1 hunks)
  • nemo_rl/models/megatron/common.py (1 hunks)
  • nemo_rl/models/policy/megatron_policy_worker.py (5 hunks)
  • tests/unit/models/generation/test_vllm_generation.py (1 hunks)
  • tests/unit/models/policy/test_megatron_worker.py (1 hunks)
  • tools/refit_verifier.py (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Lint check
🔇 Additional comments (21)
examples/configs/recipes/llm/dpo-llama3.1-8b-instruct-4n8g-megatrontp2pp2-quick.yaml (1)

65-65: Confirm dense-path compatibility
Wiring of moe_permute_fusion is implemented in nemo_rl/models/policy/megatron_policy_worker.py (line 566). Please run E2E tests to ensure dense (non-MoE) code paths tolerate this flag without side effects.

examples/configs/grpo_math_1B_megatron.yaml (1)

92-92: Resolve moe_permute_fusion flag presence
The moe_permute_fusion: false line is already present (line 92), matching other configs and placed correctly after moe_router_bias_update_rate. No further changes needed.

examples/configs/recipes/llm/sft-llama3.1-8b-1n8g-megatron.yaml (1)

55-55: Verify moe_permute_fusion plumbing with sequence_parallel=False and pipeline_model_parallel_size=2. Cannot confirm the flag is actually consumed by the worker in this topology—please ensure it’s passed through and that the corresponding kernels exist; if not, keep it false and document.

3rdparty/Megatron-Bridge-workspace/Megatron-Bridge (1)

1-1: Megatron-Bridge submodule bump verified—next steps

  • Verified that Megatron-Bridge is pinned to b99e1ca4e45d648fd8e2bffd63f5a41ede04c3da and Megatron-LM to 383d1144c3b3f77096c63b7308402a0ea6ba47dd, and .gitmodules correctly tracks the yifu/nemo-rl-use-chunkpatch-ds branch as shallow.
  • Manually confirm that this Megatron-Bridge commit remains API-compatible with the Megatron-LM bump.
  • Once PR 440 merges upstream, re-point .gitmodules to a stable branch/tag and pin to a released commit for reproducibility.
  • Ensure CI runs git submodule sync && git submodule update --init --recursive so nightlies and CP jobs use the exact pointers.
  • Add a brief note to RELEASE_NOTES or README explaining the temporary branch tracking and the plan for moving to an upstream tag.
examples/configs/recipes/llm/grpo-llama3.2-1b-instruct-1n8g-megatron.yaml (1)

59-59: Adds moe_permute_fusion flag — OK

Flag wiring looks consistent with other configs and worker code. No issues.

examples/configs/sft.yaml (1)

89-89: Adds moe_permute_fusion flag — OK

Matches the repo-wide pattern; safe even with megatron_cfg.enabled: false.

examples/configs/recipes/llm/dpo-llama3.1-8b-instruct-4n8g-megatron.v2.yaml (1)

65-65: Adds moe_permute_fusion flag — OK

Placement and default align with other configs.

examples/configs/recipes/llm/grpo-math-qwen3-30ba3b-megatron-tp4-32k.yaml (1)

80-81: Adds moe_permute_fusion flag — OK

Consistent with test and tooling updates.

examples/configs/recipes/llm/grpo-qwen2.5-7b-instruct-4n8g-megatron.yaml (1)

61-61: Adds moe_permute_fusion flag — OK

No concerns.

examples/configs/recipes/llm/sft-llama3.1-70b-8n8g-tp4pp2-long-megatron.yaml (1)

45-45: Adds moe_permute_fusion flag — OK

Keeps MoE options consistent across large-model recipes.

.gitmodules (1)

4-4: Submodule branches point to feature branches — please confirm CI reproducibility

Tracking yuya/nemo-rl-use-2 and yifu/nemo-rl-use-chunkpatch-ds can drift. Ensure:

  • Submodule SHAs in the superproject are pinned to the tested commits.
  • Nightly/baseline runs are captured for these exact SHAs.

Would you like a small script to print current submodule SHAs and compare against the bridge PR baselines?

Also applies to: 9-9

nemo_rl/models/megatron/common.py (2)

94-99: Potential mismatch: cu_seqlens_padded[-1] forced without validating total padded length

If pad_packed_seq_to < already-accumulated padded total, cu_seqlens_padded will become inconsistent with tensors.

Apply a guard:

-        if pad_packed_seq_to is not None:
-            cu_seqlens_padded[-1] = pad_packed_seq_to
+        if pad_packed_seq_to is not None:
+            if pad_packed_seq_to < cu_seqlens_padded[-1]:
+                raise ValueError(
+                    f"pad_packed_seq_to ({pad_packed_seq_to}) < total padded length ({int(cu_seqlens_padded[-1])})"
+                )
+            cu_seqlens_padded[-1] = pad_packed_seq_to

229-247: Unpack path ignores CP when cu_seqlens_padded is None

For CP>1 with no per-seq padding, cu_seqlens_padded can be None, and this branch treats it as “no CP,” yet output_tensor is CP-sharded. This will mis-index.

Either:

  • Always set cu_seqlens_padded=cu_seqlens.clone() when CP>1, or
  • Gate unpacking on CP explicitly (use group/world CP metadata), or
  • Document that unpack requires per-seq padding when CP>1.
    Would you like a patch to always clone cu_seqlens when cp_size>1?
tools/refit_verifier.py (1)

213-215: Add of moe_permute_fusion looks good

Flag is plumbed consistently with worker usage. No concerns.

examples/configs/recipes/llm/grpo-moonlight-16ba3b-4n8g-megatron.yaml (1)

88-91: Add of moe_permute_fusion is consistent

Matches worker/config plumbing elsewhere. Good default to false.

tests/unit/models/policy/test_megatron_worker.py (1)

98-101: Test config updated with moe_permute_fusion

Looks correct and keeps tests aligned with runtime config.

examples/configs/recipes/llm/sft-llama3.1-8b-1n8g-megatron-seqpack.yaml (1)

55-58: Add of moe_permute_fusion is fine

Consistent with other YAMLs and worker mapping.

examples/configs/sft_openmathinstruct2_megatron.yaml (1)

85-88: Add of moe_permute_fusion is fine

Keeps configs uniform.

tests/unit/models/generation/test_vllm_generation.py (1)

168-171: Add of moe_permute_fusion in Megatron test config

Matches policy worker expectations. Looks good.

nemo_rl/models/policy/megatron_policy_worker.py (2)

64-66: FSDP adapter import path change — verify submodule compatibility

Ensure the updated Megatron-LM/Megatron-Bridge pointers include megatron.core.distributed.fsdp.mcore_fsdp_adapter and CI runs Megatron nightlies per reviewer request.

Would you run the Megatron nightly suite and the RL unit tests on this PR branch to confirm the import path exists across all test matrices?


566-567: moe_permute_fusion is correctly wired into model_cfg

Mapping from cfg.megatron_cfg to model_cfg is clear. No concerns.

Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
@github-actions
Copy link

✅ Submodule Fast-Forward Check Results

Check based on commit: 66bd252 (PR #1059 from yifu/mbridge_dsv3_mcoretot)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)
Megatron-LM: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
@github-actions
Copy link

✅ Submodule Fast-Forward Check Results

Check based on commit: 32eaed7 (PR #1059 from yifu/mbridge_dsv3_mcoretot)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)
Megatron-LM: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
@github-actions
Copy link

✅ Submodule Fast-Forward Check Results

Check based on commit: 26f4f4d (PR #1059 from yifu/mbridge_dsv3_mcoretot)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)
Megatron-LM: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
@github-actions
Copy link

✅ Submodule Fast-Forward Check Results

Check based on commit: b1847d1 (PR #1059 from yifu/mbridge_dsv3_mcoretot)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)
Megatron-LM: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
3rdparty/Megatron-Bridge-workspace/setup.py (3)

48-48: Don’t ship megatron-core’s dev extras in runtime deps.

Pulling dev drags test/build tooling into user installs and increases resolver churn. Keep mlm only; expose dev via a dedicated extras or docs.

-    "megatron-core[dev,mlm]>=0.14.0a0,<0.16.0",
+    "megatron-core[mlm]>=0.14.0a0,<0.16.0",

If the submodule’s pyproject currently includes [dev,mlm], update it in tandem so the consistency check still passes.


16-17: Guard toml parsing for Python <3.11 and avoid hard fail when parser isn’t available.

tomllib is 3.11+. On older Pythons this will crash before we can even decide to skip the check. Add a tomli fallback and skip the consistency check if neither is present.

-import tomllib
+try:
+    import tomllib  # py311+
+    _HAVE_TOMLLIB = True
+except ModuleNotFoundError:
+    try:
+        import tomli as tomllib  # py<=3.10 fallback
+        _HAVE_TOMLLIB = True
+    except ModuleNotFoundError:
+        _HAVE_TOMLLIB = False
@@
-if os.path.exists(bridge_src_dir):
+if os.path.exists(bridge_src_dir) and _HAVE_TOMLLIB:
@@
-    with open(pyproject_path, "rb") as f:
-        data = tomllib.load(f)
+    with open(pyproject_path, "rb") as f:
+        data = tomllib.load(f)
@@
-    else:
+    else:
         print(
             "[megatron-bridge][setup] Dependency sets are consistent with the submodule pyproject.",
             file=sys.stderr,
         )
+elif os.path.exists(bridge_src_dir):
+    print(
+        "[megatron-bridge][setup] tomllib/tomli not available; skipping dependency consistency check.",
+        file=sys.stderr,
+    )

If you prefer enforcing the check, add a conditional dep instead: tomli>=2.0.1; python_version<'3.11' to both this file and the submodule pyproject so the set comparison still matches.

Also applies to: 55-65, 72-101


52-53: Pin a minimal filelock version (and ensure submodule pyproject matches).

Unpinned filelock can resolve very old releases; set a floor for reproducibility and to ensure modern APIs.

-    "filelock",
+    "filelock>=3.12",

Confirm the same constraint exists in Megatron-Bridge/pyproject.toml to keep the consistency check passing.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 26f4f4d and b1847d1.

📒 Files selected for processing (2)
  • 3rdparty/Megatron-Bridge-workspace/Megatron-Bridge (1 hunks)
  • 3rdparty/Megatron-Bridge-workspace/setup.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • 3rdparty/Megatron-Bridge-workspace/Megatron-Bridge
🔇 Additional comments (1)
3rdparty/Megatron-Bridge-workspace/setup.py (1)

47-49: Verify megatron-core version range
The install_requires pin in setup.py (megatron-core[dev,mlm]>=0.14.0a0,<0.16.0) must cover the APIs introduced after commit d07e4e5—if your changes rely on 0.15.x functionality, tighten it to >=0.15.0,<0.16.0.

Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
@github-actions
Copy link

✅ Submodule Fast-Forward Check Results

Check based on commit: 2999f2b (PR #1059 from yifu/mbridge_dsv3_mcoretot)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)
Megatron-LM: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

@terrykong terrykong enabled auto-merge September 10, 2025 17:43
@terrykong terrykong added this pull request to the merge queue Sep 10, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to no response for status checks Sep 10, 2025
@terrykong terrykong added this pull request to the merge queue Sep 11, 2025
Merged via the queue into main with commit 16d9128 Sep 11, 2025
26 checks passed
@terrykong terrykong deleted the yifu/mbridge_dsv3_mcoretot branch September 11, 2025 04:10
@coderabbitai coderabbitai bot mentioned this pull request Sep 14, 2025
4 tasks
guyueh1 pushed a commit to guyueh1/NeMo-RL that referenced this pull request Sep 15, 2025
…#1059)

Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Co-authored-by: Ahmad Kiswani <kiswani.ahmad@gmail.com>
Co-authored-by: Terry Kong <terryk@nvidia.com>
PrinsYin pushed a commit to PrinsYin/RL that referenced this pull request Nov 30, 2025
…#1059)

Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Co-authored-by: Ahmad Kiswani <kiswani.ahmad@gmail.com>
Co-authored-by: Terry Kong <terryk@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Deepseek v3 migration to Megatron-Bridge Add DeepSeek-V3 CP support

3 participants